When your agents
get complex

Surface patterns in production, turn them into scenarios, and improve quality with every release. For teams that can't afford to get it wrong.

Book a demoSelf-host in 15 min

claude code~/voice-agent

improve agent · vibe-eval loop

›

simulation — qualified senior candidate

11/11 · 100% · 50.46s

waiting for the assistant…

Trusted in production by

Why LangWatch

Built for where agent quality gets hard.

Two problems every team hits as their agents grow. And how we solve them.

Problem 01

A single eval can’t keep up with a complex agent.

Every team starts with evals. But when an agent uses five tools across a ten-turn conversation, one score on the final answer doesn’t tell you much. You need to see whether the right tool fired, and where the conversation broke.

LangWatch is the only platform that runs simulations and evals side by side.

Problem 02

Engineers shouldn’t own quality alone.

Engineers build the stack, the pipelines, the first eval set. From there, the best evals come from whoever knows the user best. That’s often a PM or domain expert, not the engineer.

Write scenarios with product. Hand the eval suite over sooner.

The platform

One platform, four pillars.

Agent testing, evals, traces and governance. Open by default, OpenTelemetry-native, runs against any model.

01 / 04

Agent testing

Test agents end-to-end with multi-turn simulations. A user simulator drives real conversations, a judge scores every turn, and you catch the failures single-shot evals miss.

Multi-turn simulations of real users
Per-turn judge with pass/fail criteria
Powered by Scenario, MIT-licensed OSS
Runs locally or in CI

[ Explore Scenario ]

langwatch · agent-testing

simulation — qualified senior candidate

11/11 · 100% · 50.46s

0:00 / 0:17

Hello, and thank you for joining the interview. I am an AI assistant conducting this interview — the conversation may be recorded and assessed, and you can request a human at any time. Let's start: could you tell me about a recent project where you led the development of an LLM evaluation tool?

Open by default

OpenTelemetry-native, MIT-licensed, runs against any model.

Models

OpenAI- + Anthropic- + OTLP-compatible. Drop in, no rewrites.

Frameworks

Works with the frameworks your team already uses.

Langy

Our AI tests your AI.

Langy turns a PM's goal into a full Scenario test plan — then turns the failures into pull requests.

PMs own the spec. Devs stay in flow. Nothing slips through.

PM writes the goalno codePlain English. No code, no YAML — the brief is the spec.
Langy drafts the planlivePicks the simulator, generates the scenarios, writes the JudgeAgent rubric.
Scenario runs in parallelparallelMulti-turn conversations against your agent, concurrent across projects.
JudgeAgent scores itsignedYour rubric, audited — faithfulness, policy adherence, de-escalation.
Regressions become PRsready to shipLangy drafts the prompt revision. Devs review and ship via Prompt Registry.

See Langy in the docs Watch a 90-sec demo

langy · live session

goal→plan→run→score→ship

pm · goal· pending

langy · plan· pending

langy · run· pending

langy · judge· pending

langy · ship· pending

median PM-to-PR 14 minuteswatch Langy work →

scenario · support-triage / candidate-2026.06

running · 1,247 / 2,000

user-simjudgered-teamvoice

scenariosimulatornpassverdict

1,247 conversations · 8,130 turns · 0 pass · 0 flagView triage queue →

Scenario

A thousand conversations before a single user.

LangWatch's open-source agent testing framework. Drop in to any agent in Python, TypeScript, or Go. Drive multi-turn flows yourself, or let UserSimulatorAgent play the user. JudgeAgent scores every turn. RedTeamAgent finds the edges.

Framework agnosticWraps any agent — LangGraph, CrewAI, Mastra, plain code.
Concurrent at scalePer-call isolation (ADR-001) — batch across projects.
Text, voice, adversarialSame suite, any modality. Voice loops + Crescendo built in.
MIT licensed

View on GitHub Read the docs

Live suite — Suite settles; the regression opens itself for evidence

Controls your security team will sign off on.

Production AI shouldn't ship without RBAC, audit trails, cost attribution and a key-revocation story. LangWatch makes those a first-class pillar — not a roadmap promise.

RBAC + REST APIsTeams, projects, API keys — scoped at the role level.
SCIM + SSOOkta / Azure AD / Google. Group → team auto-assignment.
Cost-center attributionSpend tracked across members, teams, projects.
Audit log → SIEMEvery prompt change, eval edit, key event — signed.

audit · workspace enron-prod · 24h

3,144 events · 0 anomalies

ActorActionTargetTime

skeeter@enron.comcreated api_keyprod-readonly · cost-center: shred-quarterly14:21:02

scim:gandalfassigned rolemaintainer · team Mordor/EU14:18:44

hank@globex.compromoted promptsupport-triage v12.1.0 → 100% traffic14:09:11

feature-flag-svcflipped flagevals.judge_v2 → prod · rollout 25%13:52:30

felicia@taylorswift.increvoked api_keystaging-token-3219 · "shake it off"13:40:18

patrick@pierce.comupdated evaluatorPII-guard · regex pattern13:11:55

signed · exportableStream to SIEM →

One endpoint. Every provider.

A drop-in proxy that speaks OpenAI- and Anthropic-compatible. Set a base URL, keep your existing SDK. Get automatic provider fallback, per-team budgets, Anthropic cache_control passthrough, and every request lands as a LangWatch trace.

OpenAI-compatibleNo SDK changes. Point base_url at the gateway.
Provider fallbackAutomatic spillover on rate limit, error or latency.
Per-team budgetsCost-center aware. Block or alert before spend.
cache_control passthroughAnthropic prompt caching honored end-to-end.

client.py

drop-in

from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.langwatch.ai/v1",
    default_headers={"x-cost-center": "payments-eu"},
)

routing · last 1h · 14,202 req

fallback rate 0.4%

openaigpt-4o

60%

anthropicclaude-sonnet

30%

awsbedrock/claude-haiku

googlevertex/gemini-pro

Enterprise

Your data, your boundary, your rules.

Five ways to deploy — same product surface, same SDKs, same governance.

LangWatch Cloud

Managed multi-tenant SaaS — app.langwatch.ai, EU / US / UK / APAC regions.

Self-hosted · Docker

docker compose up — full stack on a single machine for evaluation.

Kubernetes · Helm

charts/langwatch + charts/gateway — production HA topology.

OnPrem · AWS · GCP · Azure

Marketplace templates for your VPC, customer-managed keys.

Hybrid (OnPrem Data)

Data plane on your infra. Control plane on ours. Strict residency.

SOC 2·ISO 27001·GDPR·SSO / SAML·RBAC·Audit logs·CMK

Trust center

Customers

Trusted by teams shipping mission-critical AI.

(Names changed to protect the no-longer-innocent.)

"LangWatch became our single source of truth for agent quality. Regressions get caught in evals before they ever reach a quarterly earnings call."

Skeeter McGee

Head of AI Reliability · Enron

"We went from "we hope it works" to a deploy gate backed by 800 evaluators. Product and engineering finally agree on what good means."

Hank Scorpio

CEO · Globex Corporation

"Scenario alone paid for the year. We replayed every release against a synthetic user set and caught a tool-routing bug 48 hours before rollout."

Patrick Bateman

VP, Mergers & Acquisitions · Pierce & Pierce

17.4B

tokens traced / month

240k

evals run / day

42 ms

p95 latency

99.99%

uptime SLA

Ready when you are

Ship agents with confidence.

Thirty minutes with a LangWatch solutions engineer, your stack, live, end to end.

Talk to an engineer or start free in 90 seconds

No credit card · Cloud · VPC · Self-hosted · Local

When your agentsget complex